Lemmatized Latent Semantic Model for Language Model Adaptation of Highly Inflected Languages
نویسندگان
چکیده
We present a method to adapt statistical N-gram models for large vocabulary continuous speech recognition of highly inflected languages. The method combines morphological analysis, latent semantic analysis (LSA) and fast marginal adaptation for building topic-adapted trigram models, based on a background language model and very short adaptation texts. We compare words, lemmas and morphemes as basic units for language model adaptation. Experiments on a set of Estonian test texts and broadcast news speech data show that lemmas and morphemes give better performance than words in all cases. In speech recognition experiments, morpheme-based adaptation is found to perform significantly better than lemma-based adaptation.
منابع مشابه
LSA-based language model adaptation for highly inflected languages
This paper presents a language model topic adaptation framework for highly inflected languages. In such languages, subword units are used as basic units for language modeling. Since such units carry little semantic information, they are not very suitable for topic adaptation. We propose to lemmatize the corpus of training documents before constructing a latent topic model. To adapt language mod...
متن کاملLSA learner sentence comprehension in agglutinative and non-agglutinative languages
This work has been carried out in the context of automatic evaluation of learner summaries where text comprehension is gained using Latent Semantic Analysis (LSA) and Natural Language Processing (NLP) techniques. We had intuitively observed that lemmatized versions of LSA matrixes resembled better human Basque similarity judgement than the non lemmatized ones. This research was conducted to tes...
متن کاملA Framework for Language Model Adaptation for Highly-Inflected Slovenian Language
This paper describes a new framework to construct topicadapted language models for large vocabulary speech recognition of highly-inflected Slovenian language. Two important difficulties of high inflectionality in Slovenian language are discussed, out-of-vocabulary rate and feature extraction for topic detection. To use the most popular language models (N-grams) and the well-known classifiers (T...
متن کاملRapid Unsupervised Topic Adaptation – a Latent Semantic Approach
In open-domain language exploitation applications, a wide variety of topics with swift topic shifts has to be captured. Consequently, it is crucial to rapidly adapt all language components of a spoken language system. This thesis addresses unsupervised topic adaptation in both monolingual and crosslingual settings. For automatic speech recognition we rapidly adapt a language model on a source l...
متن کاملTopic detection for language model adaptation of highly-inflected languages by using a fuzzy comparison function
A new framework is proposed to construct corpus-based topicadapted language models for large vocabulary speech recognition of highly-inflected Slovenian language. The proposed techniques can be applied to other Slavic languages, where words are formed by many different inflectional affixatation. In this article an attempt to overcome two important difficulties of highly-inflected languages (hig...
متن کامل